In [ ]:

Clipping Outliers: NYC hotel pricing dataset analysis¶

Problem Description¶

For illustration of the clipping method, lets look at an example.

We have a dataset named nyc_airbnb.csv , which contains data about price of AirBnb per-night rental houses. In the dataset, there exists some outliers in the price column. Our task is to find out the outliers and handle them by winsorizing/clipping.

First , we load our dataset "New York Housing" into a dataframe and view it.

Load the Dataset and View data¶


In [2]:
import pandas as pd
nyc=pd.read_csv("../datasets/nyc_airbnb.csv")
In [3]:
nyc
Out[3]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48890 36484665 Charming one bedroom - newly renovated rowhouse 8232441 Sabrina Brooklyn Bedford-Stuyvesant 40.67853 -73.94995 Private room 70 2 0 NaN NaN 2 9
48891 36485057 Affordable room in Bushwick/East Williamsburg 6570630 Marisol Brooklyn Bushwick 40.70184 -73.93317 Private room 40 4 0 NaN NaN 2 36
48892 36485431 Sunny Studio at Historical Neighborhood 23492952 Ilgar & Aysel Manhattan Harlem 40.81475 -73.94867 Entire home/apt 115 10 0 NaN NaN 1 27
48893 36485609 43rd St. Time Square-cozy single bed 30985759 Taz Manhattan Hell's Kitchen 40.75751 -73.99112 Shared room 55 1 0 NaN NaN 6 2
48894 36487245 Trendy duplex in the very heart of Hell's Kitchen 68119814 Christophe Manhattan Hell's Kitchen 40.76404 -73.98933 Private room 90 7 0 NaN NaN 1 23

48895 rows × 16 columns

Check for outliers in price data¶

Plot a strip plot for outlier estimation:¶


In [4]:
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
In [5]:
price_strip = px.strip(nyc, y='price')

price_strip.show()

Use boxplot to estimate outliers:¶


In [6]:
box_price = nyc.boxplot(column='price', figsize=(15,5), fontsize='10', vert=False)

box_price
Out[6]:
<AxesSubplot:>

Observe price distribution in terms of numbers:¶

In [7]:
nyc['price'].describe()
Out[7]:
count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

Find the Outliers:¶

Calculate Q3:¶

In [8]:
q3= nyc['price'].quantile(0.75)

print("q3:",q3)
q3: 175.0

Calculate Q1:¶

In [9]:
## find the 25th percentile value
q1= nyc['price'].quantile(0.25)

print("q1:",q1)
q1: 69.0

Find the interquartile range (IQR):¶

In [10]:
iqr= q3 - q1

print("iqr:",iqr)
iqr: 106.0

Calculate the upper and lower bound for outliers:¶

In [11]:
upper_bound= q3 + 1.5*iqr

print("upper bound",upper_bound)

lower_bound= q1 - 1.5*iqr

print("lower bound",lower_bound)
upper bound 334.0
lower bound -90.0

Clip the Outliers:¶

Find the clipping points:¶

Clipping bound using the clipping points:¶

In [12]:
lower_point= max(lower_bound,nyc['price'].min())

print("lower_point", lower_point)

upper_point= min(upper_bound,nyc['price'].max())

print("upper_point", upper_point)
lower_point 0
upper_point 334.0

Clip outliers using the clipping points:¶

In [13]:
nyc['price'] = nyc['price'].clip(lower_point, upper_point)

Check final clipped data:¶

In [14]:
nyc['price'].describe()
Out[14]:
count    48895.000000
mean       132.979753
std         83.530504
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max        334.000000
Name: price, dtype: float64

Distribution after clipping data¶

Using strip plot:¶

In [15]:
# plt.scatter(x= nyc.index, y= nyc['price'])
# #plt.hist(nyc['price'],20)
# plt.show()

final= px.strip(nyc, y='price')
final.show()

Using boxplot:¶

In [16]:
box_price2= nyc.boxplot(column='price', figsize=(10,5), fontsize='8', vert=False)
box_price2
Out[16]:
<AxesSubplot:>

Conclusion¶

By using the clip method, we have removed our outliers from the price data. Now using this dataset will give us good predictions of hotel prices.